Credit Card Users Churn Prediction : Problem Statement

Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

  • Explore and visualize the dataset.
  • Build a classification model to predict if the customer is going to churn or not
  • Optimize the model using appropriate techniques
  • Generate a set of insights and recommendations that will help the bank

Data Dictionary:

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  • Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  • Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  • Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Let's import relevant libraries

In [1]:
# To help with reading and manipulation of data
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',None)#removes the limit of displayed columns
pd.set_option('display.max_rows',100)# Sets the limit for the number of displayed rows


# To help with data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)

#To split the data
from sklearn.model_selection import train_test_split

#To impute missing values
from sklearn.impute import KNNImputer

#To build the required models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier


#To tune a model
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

#To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler


#To get different model performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix,accuracy_score, recall_score,precision_score,f1_score

#To create pipeline
from sklearn.pipeline import make_pipeline

#To use standard scaler
from sklearn.preprocessing import StandardScaler

#To suppress warnings
import warnings
warnings.filterwarnings('ignore')

#to make the codes well structured automatically
%load_ext nb_black

Let's load the data to work on

In [2]:
churn_data = pd.read_csv("BankChurners.csv")
churn_data.head(20)  # to check the first 20 rows of the dataset
Out[2]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000
5 713061558 Existing Customer 44 M 2 Graduate Married $40K - $60K Blue 36 3 1 2 4010.0 1247 2763.0 1.376 1088 24 0.846 0.311
6 810347208 Existing Customer 51 M 4 NaN Married $120K + Gold 46 6 1 3 34516.0 2264 32252.0 1.975 1330 31 0.722 0.066
7 818906208 Existing Customer 32 M 0 High School NaN $60K - $80K Silver 27 2 2 2 29081.0 1396 27685.0 2.204 1538 36 0.714 0.048
8 710930508 Existing Customer 37 M 3 Uneducated Single $60K - $80K Blue 36 5 2 0 22352.0 2517 19835.0 3.355 1350 24 1.182 0.113
9 719661558 Existing Customer 48 M 2 Graduate Single $80K - $120K Blue 36 6 3 3 11656.0 1677 9979.0 1.524 1441 32 0.882 0.144
10 708790833 Existing Customer 42 M 5 Uneducated NaN $120K + Blue 31 5 3 2 6748.0 1467 5281.0 0.831 1201 42 0.680 0.217
11 710821833 Existing Customer 65 M 1 NaN Married $40K - $60K Blue 54 6 2 3 9095.0 1587 7508.0 1.433 1314 26 1.364 0.174
12 710599683 Existing Customer 56 M 1 College Single $80K - $120K Blue 36 3 6 0 11751.0 0 11751.0 3.397 1539 17 3.250 0.000
13 816082233 Existing Customer 35 M 3 Graduate NaN $60K - $80K Blue 30 5 1 3 8547.0 1666 6881.0 1.163 1311 33 2.000 0.195
14 712396908 Existing Customer 57 F 2 Graduate Married Less than $40K Blue 48 5 2 2 2436.0 680 1756.0 1.190 1570 29 0.611 0.279
15 714885258 Existing Customer 44 M 4 NaN NaN $80K - $120K Blue 37 5 1 2 4234.0 972 3262.0 1.707 1348 27 1.700 0.230
16 709967358 Existing Customer 48 M 4 Post-Graduate Single $80K - $120K Blue 36 6 2 3 30367.0 2362 28005.0 1.708 1671 27 0.929 0.078
17 753327333 Existing Customer 41 M 3 NaN Married $80K - $120K Blue 34 4 4 1 13535.0 1291 12244.0 0.653 1028 21 1.625 0.095
18 806160108 Existing Customer 61 M 1 High School Married $40K - $60K Blue 56 2 2 3 3193.0 2517 676.0 1.831 1336 30 1.143 0.788
19 709327383 Existing Customer 45 F 2 Graduate Married abc Blue 37 6 1 2 14470.0 1157 13313.0 0.966 1207 21 0.909 0.080
  • The dataset looks consistent with the description provided in the Data Dictionary. However, the CLIENTNUM column does not contain any information relevant to the project objectives. This column will be dropped shortly in the project.
In [3]:
df = churn_data.copy()  # To save a copy of the original data
In [4]:
df.shape  # To check the number of rows and columns in the data set
Out[4]:
(10127, 21)
  • The dataset has 10127 rows and 21 columns
In [5]:
df.duplicated().sum()  # To check for duplicated rows in the data set
Out[5]:
0
  • There is no duplicated row in the dataset
In [6]:
df.info()  # To print the concise summary of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
  • We can be further seen that there are a total of 21 columns and 10127 number of rows in the dataset.
  • All columns' data types are a mix of integer, float and object type.
  • The number of non-null value in columns like Education_Level,and Marital_Status is not equal to number of total rows in the dataset.This means that those features contain missing observations. We can further confirm this using isna() method.
  • It is also worthy of mention here that the Attrition_Flag is our target feature based on the problem statement above while other features are the independent variables(predictors)
In [7]:
df.isnull().sum()  # To further confirm if there are missing values in the dataset or not
Out[7]:
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
  • As observed earlier, Education_Level and Marital Status contained some missing observations
In [8]:
df.drop(
    "CLIENTNUM", axis=1, inplace=True
)  # To drop the irrelevant feature to the project objective
In [9]:
df.columns  # To check the names of the features on the dataset
Out[9]:
Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
       'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
       'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')
  • Just as said earlier, from the project objectives, it can be observed that our target feature is 'Attrition_Flag' while others are independent features

Summary of the dataset

In [10]:
df.describe().T  # To display the statistical summary of the numerical features
Out[10]:
count mean std min 25% 50% 75% max
Customer_Age 10127.0 46.325960 8.016814 26.0 41.000 46.000 52.000 73.000
Dependent_count 10127.0 2.346203 1.298908 0.0 1.000 2.000 3.000 5.000
Months_on_book 10127.0 35.928409 7.986416 13.0 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.0 3.812580 1.554408 1.0 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.0 2.341167 1.010622 0.0 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.0 2.455317 1.106225 0.0 2.000 2.000 3.000 6.000
Credit_Limit 10127.0 8631.953698 9088.776650 1438.3 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.0 1162.814061 814.987335 0.0 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.0 7469.139637 9090.685324 3.0 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.0 0.759941 0.219207 0.0 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.0 4404.086304 3397.129254 510.0 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.0 64.858695 23.472570 10.0 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.0 0.712222 0.238086 0.0 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.0 0.274894 0.275691 0.0 0.023 0.176 0.503 0.999
  • The mean, Standard deviation and Median(50th Percentile) of all numerical features are all displayed above.
  • However worthy of mention is that the Mean Customer_Age is 46.3 while the oldest customer is 73.
  • The Credit_Limit also has mean,median and maximum number as 8631.95,4549.00 and 34516.00 respectively. This suggests a right-skewed distribution. This same pattern can be observed in Avg_Open_To_Buy and Total_Trans_Amt.
  • The Total_Revolving_Bal also has mean,median and maximum number as 1162.81,1276.00 and 2517.00 respectively. This suggests a left-skewed distribution. This same pattern can be observed in Total_Trans_Ct.
  • We shall explore these distributions further shortly.

Let's look at the unique values of all the categorical features

In [11]:
# To display the unique values in each categorical feature
col_cats = df.select_dtypes(["object"])
for i in col_cats.columns:
    print("Unique values in", i, "are:")
    print(col_cats[i].value_counts())
    print("*" * 50)
Unique values in Attrition_Flag are:
Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64
**************************************************
Unique values in Gender are:
F    5358
M    4769
Name: Gender, dtype: int64
**************************************************
Unique values in Education_Level are:
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: Education_Level, dtype: int64
**************************************************
Unique values in Marital_Status are:
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64
**************************************************
Unique values in Income_Category are:
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64
**************************************************
Unique values in Card_Category are:
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: Card_Category, dtype: int64
**************************************************
  • It can be observed that a data collection or typo error occurred in the Income_Category with 'abc' displayed as another Income_Category class. This will be treated as 'Unknown' now.
In [12]:
df["Income_Category"] = df["Income_Category"].replace(
    "abc", "Unknown"
)  # To replace abc in Income_Category with 'Unknown'
In [13]:
df[
    "Income_Category"
].value_counts()  # To check the new unique categories in the Income_Category
Out[13]:
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
Unknown           1112
$120K +            727
Name: Income_Category, dtype: int64
  • It shows that the replacement of 'abc' with 'Unknown' works fine

Univariate Analysis

In [14]:
# Function to create both boxplot and histogram that will contain both mean and median values of each feature
def hist_box_plt(feature, figsize=(15, 10), bins=None):
    sns.set(font_scale=2)
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,
        sharex=True,
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )
    sns.boxplot(feature, ax=ax_box2, showmeans=True, color="g")
    sns.distplot(feature, kde=True, ax=ax_hist2, bins=bins) if bins else sns.distplot(
        feature, kde=True, ax=ax_hist2
    )
    ax_box2.axvline(np.mean(feature), color="red", linestyle="--")
    ax_box2.axvline(np.median(feature), color="black", linestyle="-")
    ax_hist2.axvline(np.mean(feature), color="red", linestyle="--")
    ax_hist2.axvline(np.median(feature), color="black", linestyle="-")


# Function to create barplots that indicate percentage for each category
def perc_on_bar(feature):
    total = len(feature)  # length of the column
    plt.figure(figsize=(15, 5))
    ax = sns.countplot(feature, palette="bright")
    for p in ax.patches:
        percentage = "{:.1f}%".format(
            100 * p.get_height() / total
        )  # To show percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05  # Defining the width of the plot
        y = p.get_y() + p.get_height()  # Defining the height of the plot
        ax.annotate(percentage, (x, y), size=15)  # To annotate the percentage
    plt.show()  # To show the plot

Attrition_Flag

In [15]:
perc_on_bar(df.Attrition_Flag)  # To create the barplot of Attrition_Flag
  • It is observed that 83.9% of the bank's customers are still existing credit card customers while 16.1% have attrited

Customer_Age

In [16]:
hist_box_plt(df.Customer_Age)  # To plot the boxplot and histogram of Customer_Age
  • It is observed that the distribution of the Customer_Age is close to normal. However, there are few outliers to the right. 75% of the customers are also below 55 years

Gender

In [17]:
perc_on_bar(df.Gender)  # To plot the barplot of Gender
  • It can be observed that there are more female credit card customers than males in the bank

Dependent_count

In [18]:
perc_on_bar(df.Dependent_count)  # To plot the barplot of Dependent_count
  • It is observed that 75% of the customers do not have more than 3 Dependent_count. While 27% have 3 dependents only 4.2% have have 5 dependents

Education_Level

In [19]:
perc_on_bar(df.Education_Level)  # To plot the barplot of Education_Level
  • It can be observed that Graduate constitute the highest percentage(30.9%) of the credit card clients while clients with Doctorate degree are the least constituting only 4.5%

Marital_Status

In [20]:
perc_on_bar(df.Marital_Status)  # To plot the barplot of Marital_Status
  • It is observed that the highest percentage of clients are Married(46.3%) followed by those who are Single while Divorced clients are only 7.4%

Income_Category

In [21]:
perc_on_bar(df.Income_Category)  # To plot the barplot of Income_Category
  • It can be observed that clients earning 'less that 40k' forms the highest percentage(35.2%) while those earning '120k +' are the least in number constituting only 7.2% of the credit card clients

Card_Category

In [22]:
perc_on_bar(df.Card_Category)  # To plot the barplot of Card_Category
  • It is observed that Blue is the dorminant credit card category among the clients constituting 93.2% of the credit cards issued. This suggests that Blue Card is the flagship card of the bank enjoying the highest patronage. Platinun card is the least patronised constituting only 0.2% of the card sales

Months_on_book

In [23]:
hist_box_plt(df.Months_on_book)  # To plot the boxplot and histplot of Months_on_book
  • It can observed that 75% of the clients have been with the bank for not more than 40 months. The Months_on_book has the similitude of normal distribution with its mean and median almost the same but the distribution has outliers on both right and left sides.

Total_Relationship_Count

In [24]:
perc_on_bar(
    df.Total_Relationship_Count
)  # To plot the barplot of Total_Relationship_Count
  • It can be observed that the clients with 3 bank's products constitute the highest percentage(22.8%). There appears no significant difference in the percentages of the clients with 4, 5, and 6 bank's products. However, clients that have just 1 product of the bank constitute the least percentage(9.0%)

Months_Inactive_12_mon

In [25]:
perc_on_bar(df.Months_Inactive_12_mon)  # To plot the barplot of Months_Inactive_12_mon
  • Customers that were inactive for 3 months in the last 12 months are the highest in number constituting 38% of the total clients.
  • Customers that were active throughout the last 12 months only constitute the least percentage of 0.3%
  • Worthy of mention is that more than 90% of the credit card clients have been inactive for between 1-3 months in the last 12 months

Contacts_Count_12_mon

In [26]:
perc_on_bar(df.Contacts_Count_12_mon)  # To plot the barplot of Contacts_Count_12_mon
  • Customers who had 3 contacts with the bank in the last 12 months constitute the highest percentage 33.4%
  • Customers who had 6 contacts with the bank in the last 12 months constitute the least percentage of 0.5%
  • About 80% of the customers have had between 1-3 contacts with the bank in the last 12 months

Credit_Limit

In [27]:
hist_box_plt(df.Credit_Limit)  # To plot the boxplot and histplot of the Credit_Limit
  • 75% of the clients are given less than 12,000 as Credit_Limit. The distribution of Credit_Limit is highly skewed to the right with a lot of outliers to the right

Total_Revolving_Bal

In [28]:
hist_box_plt(
    df.Total_Revolving_Bal
)  # To plot the boxplot and histplot of the Total_Revolving_Bal
  • The spread of Total_Revolving_Bal of the clients is also skewed with 75% of the clients having less than 1800 as Total_Revolving_Bal

Avg_Open_To_Buy

In [29]:
hist_box_plt(
    df.Avg_Open_To_Buy
)  # To plot the boxplot and histplot of the Avg_Open_To_Buy
  • 75% of the clients have less than 10,000 on the average as the amount left on their credit cards in the last 12 months. The distribution of Credit_Limit is highly skewed to the right with a lot of outliers to the right.

Total_Amt_Chng_Q4_Q1

In [30]:
hist_box_plt(
    df.Total_Amt_Chng_Q4_Q1
)  # To plot the boxplot and histplot of the Total_Amt_Chng_Q4_Q1
  • It is observed that distribution of Total_Amt_Chng_Q4_Q1 is skewed to both sides however, more to the right with many outliers on both sides. More than 75% of the clients have total transaction amounts in Q4 lower than Q1 as their ratio is less than 1

Total_Trans_Amt

In [31]:
hist_box_plt(
    df.Total_Trans_Amt
)  # To plot the boxplot and histplot of the Total_Trans_Amt
  • It is observed that the more than 75% of the clients have thier total transaction amounts in the last 12 months to be less than 5000. Also, the distribution of the Total_Trans_Amt is right-skewed with many outliers to the right

Total_Trans_Ct

In [32]:
hist_box_plt(
    df.Total_Trans_Ct
)  # To plot the boxplot and histplot of the Total_Trans_Ct
  • It is observed that the distribution of Total_Trans_Ct is slightly right-skewed with few outliers to the right. Also, most of the clients have their total transaction counts in the last 12 months to be less than 100.

Total_Ct_Chng_Q4_Q1

In [33]:
hist_box_plt(
    df.Total_Ct_Chng_Q4_Q1
)  # To plot the boxplot and histplot of the Total_Ct_Chng_Q4_Q1
  • It is observed that distribution of Total_Ct_Chng_Q4_Q1 looks normal but is skewed to both sides and more to the right with many outliers on both sides. More than 75% of the clients have total transaction counts in Q4 lower than Q1 as their ratio is less than 1

Avg_Utilization_Ratio

In [34]:
hist_box_plt(
    df.Avg_Utilization_Ratio
)  # To plot the boxplot and histplot of the Avg_Utilization_Ratio
  • Most of the clients spends less than their available credit limit. However, the Avg_Utilization_Ratio distribution is right-skewed.

Bivariate Analysis

In [35]:
# We shall plot the heatmap of the correlation ratio among the numerical data
plt.figure(figsize=(25, 15))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • The heatmap above shows some strong positive relationships between Customer_Age and Months_on_book, Total_Trans_Amt and Total_Trans_Ct, Total_Revolving_Bal and Avg_Utilization_Ratio. Worthy of note is also a perfect positive correlation between Credit_Limit and Avg_Open_To_Buy. This means that a positive change in one variable will also witness a positive change in the other variable and vice versa.
In [36]:
sns.pairplot(df)  # To plot the pairplot of all the pairs of the numerical features
plt.show()
  • We can see varying distributions among the variables which need further investigations.
In [37]:
# Function to plot stacked bar charts for Attrition_Flag against other variables
def stacked_plt(x):
    sns.set()
    # Crosstab
    tab_ = pd.crosstab(x, df["Attrition_Flag"], margins=True).sort_values(
        by="Attrited Customer", ascending=False
    )
    print(tab_)
    print("-" * 120)
    # Visualising the crosstab
    tab = pd.crosstab(x, df["Attrition_Flag"], normalize="index").sort_values(
        by="Attrited Customer", ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(17, 7))
    plt.legend(loc="lower left", frameon=False)
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Attrition_Flag Vs Customer_Age

In [38]:
stacked_plt(
    df.Customer_Age
)  # To plot stacked barplot of Attrition_Flag and Customer_Age
Attrition_Flag  Attrited Customer  Existing Customer    All
Customer_Age                                               
All                          1627               8500  10127
43                             85                388    473
48                             85                387    472
44                             84                416    500
46                             82                408    490
45                             79                407    486
49                             79                416    495
47                             76                403    479
41                             76                303    379
50                             71                381    452
54                             69                238    307
40                             64                297    361
42                             62                364    426
53                             59                328    387
52                             58                318    376
51                             58                340    398
55                             51                228    279
39                             48                285    333
38                             47                256    303
56                             43                219    262
59                             40                117    157
37                             37                223    260
57                             33                190    223
58                             24                133    157
36                             24                197    221
35                             21                163    184
33                             20                107    127
34                             19                127    146
32                             17                 89    106
61                             17                 76     93
62                             17                 76     93
30                             15                 55     70
31                             13                 78     91
60                             13                114    127
65                              9                 92    101
63                              8                 57     65
29                              7                 49     56
26                              6                 72     78
64                              5                 38     43
27                              3                 29     32
28                              1                 28     29
66                              1                  1      2
68                              1                  1      2
67                              0                  4      4
70                              0                  1      1
73                              0                  1      1
------------------------------------------------------------------------------------------------------------------------
  • It can be observed the customers Age 66 and 68 have highest likelihood(50%) of attriting. However, customers within the age bracket 40-55 attrited the most in the last 12 months constituting approximately 70% of the credit card customers that attrited

Attrition_Flag Vs Gender

In [39]:
stacked_plt(df.Gender)  # To plot the stacked plot of Attrition_Flag and Gender
Attrition_Flag  Attrited Customer  Existing Customer    All
Gender                                                     
All                          1627               8500  10127
F                             930               4428   5358
M                             697               4072   4769
------------------------------------------------------------------------------------------------------------------------
  • It can be observed that although Female clients have slightly higher likelihood of attriting, there is no significant difference in the likelihood of attrition between male and female clients. In terms of absolute number, more female clients have attrited than their male counterparts

Attrition_Flag Vs Dependent_count

In [40]:
stacked_plt(
    df.Dependent_count
)  # To plot the stacked plot of Attrition_Flag and Dependent_count
Attrition_Flag   Attrited Customer  Existing Customer    All
Dependent_count                                             
All                           1627               8500  10127
3                              482               2250   2732
2                              417               2238   2655
1                              269               1569   1838
4                              260               1314   1574
0                              135                769    904
5                               64                360    424
------------------------------------------------------------------------------------------------------------------------
  • There appears to be no significant difference in the likelihood of attrition based on the client's dependent counts. However, about 55% of credit card clients that have attrited have between 2-3 dependent counts

Attrition_Flag Vs Education_Level

In [41]:
stacked_plt(
    df.Education_Level
)  # To plot the stacked plot of Attrition_Flag and Education_Level
Attrition_Flag   Attrited Customer  Existing Customer   All
Education_Level                                            
All                           1371               7237  8608
Graduate                       487               2641  3128
High School                    306               1707  2013
Uneducated                     237               1250  1487
College                        154                859  1013
Doctorate                       95                356   451
Post-Graduate                   92                424   516
------------------------------------------------------------------------------------------------------------------------
  • For educated customers, it appears that the higher the level of their education, the higher their likelihood of attriting. Only customers with Doctorate degree have above 20% probability of attriting among all others. However, worthy of mention is that in terms of absolute numbers, Graduate customers have attrited the most.

Attrition_Flag Vs Marital_Status

In [42]:
stacked_plt(
    df.Marital_Status
)  # To plot the stacked plot of Attrition_Flag and Marital_Status
Attrition_Flag  Attrited Customer  Existing Customer   All
Marital_Status                                            
All                          1498               7880  9378
Married                       709               3978  4687
Single                        668               3275  3943
Divorced                      121                627   748
------------------------------------------------------------------------------------------------------------------------
  • It is obvious that there are no significant difference in the likelihood of attriting by customers based on their Marital_Status. All of them have below 20% likelihood of attriting. However, Married customers attrited the most in the bank.

Attrition_Flag Vs Income_Category

In [43]:
stacked_plt(
    df.Income_Category
)  # To plot the stacked plot of Attrition_Flag and Income_Category
Attrition_Flag   Attrited Customer  Existing Customer    All
Income_Category                                             
All                           1627               8500  10127
Less than $40K                 612               2949   3561
$40K - $60K                    271               1519   1790
$80K - $120K                   242               1293   1535
$60K - $80K                    189               1213   1402
Unknown                        187                925   1112
$120K +                        126                601    727
------------------------------------------------------------------------------------------------------------------------
  • It can be observed that the there is no significant difference in the likelihood of attriting among customers of all Income_Category classes. However, customers in Income_Category less than 40k attrited the most in terms of absolute numbers

Attrition_Flag Vs Card_Category

In [44]:
stacked_plt(
    df.Card_Category
)  # To plot the stacked plot of Attrition_Flag and Card_Category
Attrition_Flag  Attrited Customer  Existing Customer    All
Card_Category                                              
All                          1627               8500  10127
Blue                         1519               7917   9436
Silver                         82                473    555
Gold                           21                 95    116
Platinum                        5                 15     20
------------------------------------------------------------------------------------------------------------------------
  • It is clear from the above plot that the customers with Platinum cards have the highest likelihood of attriting in the bank. In fact, one in every four of them is most likely to attrite. Silver card customers have the least likelihood of attriting. However, Blue card holders attrited the most in terms of absolute numbers.

Attrition_Flag Vs Months_on_book

In [45]:
stacked_plt(
    df.Months_on_book
)  # To plot the stacked plot of Attrition_Flag and Months_on_book
Attrition_Flag  Attrited Customer  Existing Customer    All
Months_on_book                                             
All                          1627               8500  10127
36                            430               2033   2463
39                             64                277    341
37                             62                296    358
30                             58                242    300
38                             57                290    347
34                             57                296    353
41                             51                246    297
33                             48                257    305
40                             45                288    333
35                             45                272    317
32                             44                245    289
28                             43                232    275
44                             42                188    230
43                             42                231    273
46                             36                161    197
42                             36                235    271
29                             34                207    241
31                             34                284    318
45                             33                194    227
25                             31                134    165
24                             28                132    160
48                             27                135    162
50                             25                 71     96
49                             24                117    141
26                             24                162    186
47                             24                147    171
27                             23                183    206
22                             20                 85    105
56                             17                 86    103
51                             16                 64     80
18                             13                 45     58
20                             13                 61     74
52                             12                 50     62
23                             12                104    116
21                             10                 73     83
15                              9                 25     34
53                              7                 71     78
13                              7                 63     70
19                              6                 57     63
54                              6                 47     53
17                              4                 35     39
55                              4                 38     42
16                              3                 26     29
14                              1                 15     16
------------------------------------------------------------------------------------------------------------------------
  • Customers who have spent 15,50,18 and 51 months with the bank have at least 20% likelihood of attriting while the rest have less than 20% likelihood of attriting. However, in terms of absolute number, customers that have spent 36 months with the bank attrited the most. Infact, the bank is advised to focus more on the customers that have spent between 35-40 months with the bank as they tend to attrite the most.

Attrition_Flag Vs Total_Relationship_Count

In [46]:
stacked_plt(
    df.Total_Relationship_Count
)  # To plot the stacked plot of Attrition_Flag and Total_Relationship_Count
Attrition_Flag            Attrited Customer  Existing Customer    All
Total_Relationship_Count                                             
All                                    1627               8500  10127
3                                       400               1905   2305
2                                       346                897   1243
1                                       233                677    910
5                                       227               1664   1891
4                                       225               1687   1912
6                                       196               1670   1866
------------------------------------------------------------------------------------------------------------------------
  • It is obvious that customers with just one or two products of the bank have higher likelihood of attriting than other customers. The bank should therefore up their products cross-selling effort to ensure that each customers is on-boarded at least on 4 products to reduce their chance of attriting.

Attrition_Flag Vs Months_Inactive_12_mon

In [47]:
stacked_plt(
    df.Months_Inactive_12_mon
)  # To plot the stacked plot of Attrition_Flag and Months_Inactive_12_mon
Attrition_Flag          Attrited Customer  Existing Customer    All
Months_Inactive_12_mon                                             
All                                  1627               8500  10127
3                                     826               3020   3846
2                                     505               2777   3282
4                                     130                305    435
1                                     100               2133   2233
5                                      32                146    178
6                                      19                105    124
0                                      15                 14     29
------------------------------------------------------------------------------------------------------------------------
  • It is clear that customers whose accounts are still active have over 50% likelihood of leaving the bank. However, customers whose accounts are inactive for 2-3 months leave the most in the terms of absolute numbers.

Attrition_Flag Vs Contacts_Count_12_mon

In [48]:
stacked_plt(
    df.Contacts_Count_12_mon
)  # To plot the stacked plot of Attrition_Flag and Contacts_Count_12_mon
Attrition_Flag         Attrited Customer  Existing Customer    All
Contacts_Count_12_mon                                             
All                                 1627               8500  10127
3                                    681               2699   3380
2                                    403               2824   3227
4                                    315               1077   1392
1                                    108               1391   1499
5                                     59                117    176
6                                     54                  0     54
0                                      7                392    399
------------------------------------------------------------------------------------------------------------------------
  • The plot above makes it clear that the more the counts of contacts a customer has with the bank, the lesser the likelihood of attriting and vice versa. The bank should therefore device a system of reaching out to their customers at least once in 2 months even if the customers do not come physically to the bank's premise. This will stem the tide of Attrition greatly.

Attrition_Flag Vs Credit_Limit

In [49]:
sns.boxplot(df["Credit_Limit"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Credit_Limit
Out[49]:
<AxesSubplot:xlabel='Credit_Limit', ylabel='Attrition_Flag'>
  • It can be observed that the credit limit of existing customers is slightly higher than that of the attrited customers

Attrition_Flag Vs Total_Revolving_Bal

In [50]:
sns.boxplot(df["Total_Revolving_Bal"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Revolving_Bal
Out[50]:
<AxesSubplot:xlabel='Total_Revolving_Bal', ylabel='Attrition_Flag'>
  • It can be observed that the Existing customers generally carry over higher amounts from one month to the other than the Attrited customers.

Attrition_Flag Vs Avg_Open_To_Buy

In [51]:
sns.boxplot(df["Avg_Open_To_Buy"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Avg_Open_To_Buy
Out[51]:
<AxesSubplot:xlabel='Avg_Open_To_Buy', ylabel='Attrition_Flag'>
  • It can be observed that most of the Attrited Customers have slightly lower amount left in the credit card in the last 12 months when compared to the exisitng customers.

Attrition_Flag Vs Total_Amt_Chng_Q4_Q1

In [52]:
sns.boxplot(df["Total_Amt_Chng_Q4_Q1"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Amt_Chng_Q4_Q1
Out[52]:
<AxesSubplot:xlabel='Total_Amt_Chng_Q4_Q1', ylabel='Attrition_Flag'>
  • It is observed that the changes in transaction amounts between Q4 and Q1 in the last 12 months for most of the Attrited customers are generally lower than those of existing customers

Attrition_Flag Vs Total_Trans_Amt

In [53]:
sns.boxplot(df["Total_Trans_Amt"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Trans_Amt
Out[53]:
<AxesSubplot:xlabel='Total_Trans_Amt', ylabel='Attrition_Flag'>
  • Most of the Attrited customers have lower total transaction amounts in the last 12 months when compared with the existing customers. This should serve as an early warning signal of attrition for the bank and should be monitored closely

Attrition_Flag Vs Total_Trans_Ct

In [54]:
sns.boxplot(df["Total_Trans_Ct"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Trans_Ct
Out[54]:
<AxesSubplot:xlabel='Total_Trans_Ct', ylabel='Attrition_Flag'>
  • As expected based on observed pattern in the Total_Trans_Amt, most of the Attrited customers have lower total transaction counts in the last 12 months when compared with the existing customers.

Attrition_Flag Vs Total_Ct_Chng_Q4_Q1

In [55]:
sns.boxplot(df["Total_Ct_Chng_Q4_Q1"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Ct_Chng_Q4_Q1
Out[55]:
<AxesSubplot:xlabel='Total_Ct_Chng_Q4_Q1', ylabel='Attrition_Flag'>
  • It is observed that the changes in transaction counts between Q4 and Q1 in the last 12 months for most of the Attrited customers are generally lower than those of existing customers

Attrition_Flag Vs Avg_Utilization_Ratio

In [56]:
sns.boxplot(df["Avg_Utilization_Ratio"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Avg_Utilization_Ratio
Out[56]:
<AxesSubplot:xlabel='Avg_Utilization_Ratio', ylabel='Attrition_Flag'>
  • How much of the available credit most of the Attrited customers spent was clearly lesser than that of existing customers.

Characteristics of Attrited Customers

In [57]:
df[df["Attrition_Flag"] == "Attrited Customer"].describe(
    include="all"
).T  # To show the statistical summary of Attrited Customers only along all features
Out[57]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 1627 1 Attrited Customer 1627 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 1627.0 NaN NaN NaN 46.659496 7.665652 26.0 41.0 47.0 52.0 68.0
Gender 1627 2 F 930 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 1627.0 NaN NaN NaN 2.402581 1.27501 0.0 2.0 2.0 3.0 5.0
Education_Level 1371 6 Graduate 487 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 1498 3 Married 709 NaN NaN NaN NaN NaN NaN NaN
Income_Category 1627 6 Less than $40K 612 NaN NaN NaN NaN NaN NaN NaN
Card_Category 1627 4 Blue 1519 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 1627.0 NaN NaN NaN 36.178242 7.796548 13.0 32.0 36.0 40.0 56.0
Total_Relationship_Count 1627.0 NaN NaN NaN 3.279656 1.577782 1.0 2.0 3.0 5.0 6.0
Months_Inactive_12_mon 1627.0 NaN NaN NaN 2.693301 0.899623 0.0 2.0 3.0 3.0 6.0
Contacts_Count_12_mon 1627.0 NaN NaN NaN 2.972342 1.090537 0.0 2.0 3.0 4.0 6.0
Credit_Limit 1627.0 NaN NaN NaN 8136.039459 9095.334105 1438.3 2114.0 4178.0 9933.5 34516.0
Total_Revolving_Bal 1627.0 NaN NaN NaN 672.822987 921.385582 0.0 0.0 0.0 1303.5 2517.0
Avg_Open_To_Buy 1627.0 NaN NaN NaN 7463.216472 9109.208129 3.0 1587.0 3488.0 9257.5 34516.0
Total_Amt_Chng_Q4_Q1 1627.0 NaN NaN NaN 0.694277 0.214924 0.0 0.5445 0.701 0.856 1.492
Total_Trans_Amt 1627.0 NaN NaN NaN 3095.025814 2308.227629 510.0 1903.5 2329.0 2772.0 10583.0
Total_Trans_Ct 1627.0 NaN NaN NaN 44.93362 14.568429 10.0 37.0 43.0 51.0 94.0
Total_Ct_Chng_Q4_Q1 1627.0 NaN NaN NaN 0.554386 0.226854 0.0 0.4 0.531 0.692 2.5
Avg_Utilization_Ratio 1627.0 NaN NaN NaN 0.162475 0.264458 0.0 0.0 0.0 0.231 0.999
  • Most of the Attrited Customers are 52 years and below, mostly females, Graduates, Married, Blue Cardholders, and earn less that 40k annually. The bank should therefore focus on engaging customers with these features possibly once every month to on-board on more bank's products and ensure thier satisfaction with the bank's service if attrition will be stemmed.

Splitting the datasets into train,validation, and test sets

In [58]:
x = df.drop("Attrition_Flag", axis=1)  # defining the independent features
x = pd.get_dummies(x)
df["Attrition_Flag"] = df["Attrition_Flag"].replace(
    {"Existing Customer": 0, "Attrited Customer": 1}
)
y = df["Attrition_Flag"]  # defining the target feature
In [59]:
x_temp, x_test, y_temp, y_test = train_test_split(
    x, y, test_size=0.25, random_state=1, stratify=y
)  # first splitting the dataset into temporarily train set and test set
In [60]:
x_train, x_val, y_train, y_val = train_test_split(
    x_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)  # then splitting of the temporarily train set into train set and validation set
print(x_train.shape, x_val.shape, x_test.shape)
(5696, 35) (1899, 35) (2532, 35)

Missing values treatment

In [61]:
# Let's impute missing values using KNNImputer
imputer = KNNImputer(n_neighbors=5)
x_train = pd.DataFrame(
    imputer.fit_transform(x_train), columns=x_train.columns
)  # imputing missing values in the train set
x_val = pd.DataFrame(
    imputer.transform(x_val), columns=x_val.columns
)  # imputing missing values in the validation set
x_test = pd.DataFrame(
    imputer.transform(x_test), columns=x_test.columns
)  # imputing missing values in the test set
In [62]:
#Let's check if there is still missingness in our data sets
print(x_train.isnull().sum())
print('*'*50)
print(x_val.isnull().sum())
print('*'*50)
print(x_test.isnull().sum())
Customer_Age                      0
Dependent_count                   0
Months_on_book                    0
Total_Relationship_Count          0
Months_Inactive_12_mon            0
Contacts_Count_12_mon             0
Credit_Limit                      0
Total_Revolving_Bal               0
Avg_Open_To_Buy                   0
Total_Amt_Chng_Q4_Q1              0
Total_Trans_Amt                   0
Total_Trans_Ct                    0
Total_Ct_Chng_Q4_Q1               0
Avg_Utilization_Ratio             0
Gender_F                          0
Gender_M                          0
Education_Level_College           0
Education_Level_Doctorate         0
Education_Level_Graduate          0
Education_Level_High School       0
Education_Level_Post-Graduate     0
Education_Level_Uneducated        0
Marital_Status_Divorced           0
Marital_Status_Married            0
Marital_Status_Single             0
Income_Category_$120K +           0
Income_Category_$40K - $60K       0
Income_Category_$60K - $80K       0
Income_Category_$80K - $120K      0
Income_Category_Less than $40K    0
Income_Category_Unknown           0
Card_Category_Blue                0
Card_Category_Gold                0
Card_Category_Platinum            0
Card_Category_Silver              0
dtype: int64
**************************************************
Customer_Age                      0
Dependent_count                   0
Months_on_book                    0
Total_Relationship_Count          0
Months_Inactive_12_mon            0
Contacts_Count_12_mon             0
Credit_Limit                      0
Total_Revolving_Bal               0
Avg_Open_To_Buy                   0
Total_Amt_Chng_Q4_Q1              0
Total_Trans_Amt                   0
Total_Trans_Ct                    0
Total_Ct_Chng_Q4_Q1               0
Avg_Utilization_Ratio             0
Gender_F                          0
Gender_M                          0
Education_Level_College           0
Education_Level_Doctorate         0
Education_Level_Graduate          0
Education_Level_High School       0
Education_Level_Post-Graduate     0
Education_Level_Uneducated        0
Marital_Status_Divorced           0
Marital_Status_Married            0
Marital_Status_Single             0
Income_Category_$120K +           0
Income_Category_$40K - $60K       0
Income_Category_$60K - $80K       0
Income_Category_$80K - $120K      0
Income_Category_Less than $40K    0
Income_Category_Unknown           0
Card_Category_Blue                0
Card_Category_Gold                0
Card_Category_Platinum            0
Card_Category_Silver              0
dtype: int64
**************************************************
Customer_Age                      0
Dependent_count                   0
Months_on_book                    0
Total_Relationship_Count          0
Months_Inactive_12_mon            0
Contacts_Count_12_mon             0
Credit_Limit                      0
Total_Revolving_Bal               0
Avg_Open_To_Buy                   0
Total_Amt_Chng_Q4_Q1              0
Total_Trans_Amt                   0
Total_Trans_Ct                    0
Total_Ct_Chng_Q4_Q1               0
Avg_Utilization_Ratio             0
Gender_F                          0
Gender_M                          0
Education_Level_College           0
Education_Level_Doctorate         0
Education_Level_Graduate          0
Education_Level_High School       0
Education_Level_Post-Graduate     0
Education_Level_Uneducated        0
Marital_Status_Divorced           0
Marital_Status_Married            0
Marital_Status_Single             0
Income_Category_$120K +           0
Income_Category_$40K - $60K       0
Income_Category_$60K - $80K       0
Income_Category_$80K - $120K      0
Income_Category_Less than $40K    0
Income_Category_Unknown           0
Card_Category_Blue                0
Card_Category_Gold                0
Card_Category_Platinum            0
Card_Category_Silver              0
dtype: int64
  • There is no more missingness in our data sets

Skewness and Outliers treatment

We will apply log transformation on some features that are highly skewed and have outliers

In [63]:
col_to_transform = (
    "Customer_Age",
    "Months_on_book",
    "Credit_Limit",
    "Total_Revolving_Bal",
    "Avg_Open_To_Buy",
    "Total_Amt_Chng_Q4_Q1",
    "Total_Trans_Amt",
    "Total_Trans_Ct",
    "Total_Ct_Chng_Q4_Q1",
    "Avg_Utilization_Ratio",
)  # Defining the skewed features that we want to apply the log transformation to

# defining a function that we apply the log transformation to all the features defined above
def trans_col(data):
    for col in col_to_transform:
        data[col] = np.arcsinh(data[col])
    return data[col]
In [64]:
trans_col(x_train)  # Applying the function on the train set
Out[64]:
0       0.671407
1       0.104808
2       0.599449
3       0.372337
4       0.000000
          ...   
5691    0.646100
5692    0.117728
5693    0.027996
5694    0.216309
5695    0.126661
Name: Avg_Utilization_Ratio, Length: 5696, dtype: float64
In [65]:
trans_col(x_val)#Applying the function on the validation set
Out[65]:
0       0.038990
1       0.000000
2       0.708278
3       0.088883
4       0.000000
          ...   
1894    0.406696
1895    0.798425
1896    0.000000
1897    0.046983
1898    0.365788
Name: Avg_Utilization_Ratio, Length: 1899, dtype: float64
In [66]:
trans_col(x_test)  # Applying the function on the test set
Out[66]:
0       0.612900
1       0.000000
2       0.000000
3       0.000000
4       0.648566
          ...   
2527    0.000000
2528    0.000000
2529    0.060962
2530    0.141527
2531    0.701924
Name: Avg_Utilization_Ratio, Length: 2532, dtype: float64
In [67]:
x_train.head()  # Checking the first 5 rows of train set to confirm if the log transformation was applied as expected
Out[67]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender_F Gender_M Education_Level_College Education_Level_Doctorate Education_Level_Graduate Education_Level_High School Education_Level_Post-Graduate Education_Level_Uneducated Marital_Status_Divorced Marital_Status_Married Marital_Status_Single Income_Category_$120K + Income_Category_$40K - $60K Income_Category_$60K - $80K Income_Category_$80K - $120K Income_Category_Less than $40K Income_Category_Unknown Card_Category_Blue Card_Category_Gold Card_Category_Platinum Card_Category_Silver
0 4.625069 1.0 4.499933 5.0 3.0 1.0 8.269757 7.945201 6.986567 0.604505 8.497807 4.477466 0.594379 0.671407 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
1 4.787561 2.0 4.564457 3.0 4.0 1.0 9.988380 7.736307 9.877246 0.692347 8.850231 4.882859 0.512296 0.104808 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 4.663528 3.0 4.454482 5.0 1.0 2.0 8.469263 8.016978 7.458186 0.776562 9.193601 5.049897 0.845573 0.599449 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
3 4.718579 1.0 4.477466 5.0 1.0 4.0 9.008958 8.043663 8.529517 1.046343 8.411833 4.356873 0.776562 0.372337 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 4.644483 2.0 4.330906 6.0 3.0 3.0 11.142325 0.000000 11.142325 0.601978 8.395252 4.543408 0.373271 0.000000 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
  • The table above shows the transformation was applied to all the selected features as expected
In [68]:
hist_box_plt(
    x_train.Credit_Limit
)  # To plot the boxplot and histplot of the Credit_Limit in train set to see how the transformation done has helped to deal with skewness and outliers earlier observed
  • The log transformation has really helped to reduce the skewness and outliers in the distribution.

Model Building

In [69]:
print("Shape of Training set : ", x_train.shape)
print("Shape of Validation set : ", x_val.shape)
print("Shape of test set : ", x_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (5696, 35)
Shape of Validation set :  (1899, 35)
Shape of test set :  (2532, 35)
Percentage of classes in training set:
0    0.839361
1    0.160639
Name: Attrition_Flag, dtype: float64
Percentage of classes in validation set:
0    0.839389
1    0.160611
Name: Attrition_Flag, dtype: float64
Percentage of classes in test set:
0    0.839258
1    0.160742
Name: Attrition_Flag, dtype: float64

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a credit card customer will attrited but in reality will not.
  2. Predicting a credit card customer will not attrite but in reality will.

Which case is more important?

  • Both the cases are important as though:
  • Hoewever, the bank is interested in identifying those credit card customers that are likely to attrite so as to improve services to them

How to reduce this loss of customers?

  • 'Recall' should be maximized, the greater the recall_score, the higher the chances of identifying the credit card customers that are likely to attrite correctly.

First, let's create functions to calculate different metrics and confusion matrix for each model.

  • The model_performance function will be used to check the model performance of models.
  • The conf_mat function will be used to plot confusion matrix
In [70]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn


def model_performance(model, predictors, target):

    # predicting using the independent variables
    pred_prob = model.predict(predictors)
    pred = np.round(pred_prob)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf
In [71]:
# defining a function to plot the confusion_matrix of a classification model built
def conf_mat(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("Actual Class")
    plt.xlabel("Predicted Class")

Logistic Regression

In [72]:
# Let's fit the Logistic regression model
lr = LogisticRegression(solver="newton-cg", random_state=1)
lr.fit(x_train, y_train)
Out[72]:
LogisticRegression(random_state=1, solver='newton-cg')
In [73]:
# checking model performance on the train data
lr_train_perf = model_performance(lr, x_train, y_train)

print("Training performance:")
lr_train_perf
Training performance:
Out[73]:
Accuracy Recall Precision F1
0 0.90467 0.559563 0.785276 0.653478
In [74]:
# checking model performance on the validation data
lr_val_perf = model_performance(lr, x_val, y_val)

print("Validation performance:")
lr_val_perf
Validation performance:
Out[74]:
Accuracy Recall Precision F1
0 0.902054 0.583607 0.751055 0.656827
In [75]:
# creating confusion matrix of the validation set
conf_mat(lr, x_val, y_val)
  • The logistic regression model generalises well on both train and validation sets but has low recall of 0.58 on validation set

Decision Tree

In [76]:
# Fitting the decision tree model
dt = DecisionTreeClassifier(random_state=1)
dt.fit(x_train, y_train)
Out[76]:
DecisionTreeClassifier(random_state=1)
In [77]:
# checking model performance on the train data
dt_train_perf = model_performance(dt, x_train, y_train)

print("Training performance:")
dt_train_perf
Training performance:
Out[77]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [78]:
# checking model performance on the validation data
dt_val_perf = model_performance(dt, x_val, y_val)

print("Validation performance:")
dt_val_perf
Validation performance:
Out[78]:
Accuracy Recall Precision F1
0 0.938389 0.783607 0.824138 0.803361
In [79]:
# creating confusion matrix of the validation set
conf_mat(dt, x_val, y_val)
  • The decision tree model overfits but shows better performance with 0.78 recall score on validation set

Random Forest Model

In [80]:
# Fitting Random forest model
rf = RandomForestClassifier(random_state=1)
rf.fit(x_train, y_train)
Out[80]:
RandomForestClassifier(random_state=1)
In [81]:
# checking model performance on the train data
rf_train_perf = model_performance(rf, x_train, y_train)

print("Training performance:")
rf_train_perf
Training performance:
Out[81]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [82]:
# checking model performance on the validation data
rf_val_perf = model_performance(rf, x_val, y_val)

print("Validation performance:")
rf_val_perf
Validation performance:
Out[82]:
Accuracy Recall Precision F1
0 0.945234 0.731148 0.910204 0.810909
In [83]:
# creating confusion matrix of the validation set
conf_mat(rf, x_val, y_val)
  • The random forest model also overfits but with good recall of 0.73 on validation set.

Bagging Classifier

In [84]:
# Fitting the bagging classifier model
bg = BaggingClassifier(random_state=1)
bg.fit(x_train, y_train)
Out[84]:
BaggingClassifier(random_state=1)
In [85]:
# checking model performance on the train data
bg_train_perf = model_performance(bg, x_train, y_train)

print("Training performance:")
bg_train_perf
Training performance:
Out[85]:
Accuracy Recall Precision F1
0 0.996664 0.982514 0.996674 0.989543
In [86]:
# checking model performance on the validation data
bg_val_perf = model_performance(bg, x_val, y_val)

print("Validation performance:")
bg_val_perf
Validation performance:
Out[86]:
Accuracy Recall Precision F1
0 0.955766 0.816393 0.898917 0.85567
In [87]:
# creating confusion matrix of the validation set
conf_mat(bg, x_val, y_val)
  • Bagging Classifier model also overfits but shows better performance with recall of 0.82.

Adaboost Classifier Model

In [88]:
# Fitting AdaboostClassifier model
abc = AdaBoostClassifier(random_state=1)
abc.fit(x_train, y_train)
Out[88]:
AdaBoostClassifier(random_state=1)
In [89]:
# checking model performance on the train data
abc_train_perf = model_performance(abc, x_train, y_train)

print("Training performance:")
abc_train_perf
Training performance:
Out[89]:
Accuracy Recall Precision F1
0 0.965414 0.865574 0.91455 0.889388
In [90]:
# checking model performance on the validation data
abc_val_perf = model_performance(abc, x_val, y_val)

print("Validation performance:")
abc_val_perf
Validation performance:
Out[90]:
Accuracy Recall Precision F1
0 0.945761 0.793443 0.858156 0.824532
In [91]:
# creating confusion matrix of the validation set
conf_mat(abc, x_val, y_val)
  • The AdaBoost Classifier Model also overfits a little but shows good performance on the validation set with recall score of 0.79

XGBoost Classifier

In [92]:
# Fitting XGBoost Classfier model on train set
xgc = XGBClassifier(random_state=1, eval_metric="logloss")
xgc.fit(x_train, y_train)
Out[92]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [93]:
# checking model performance on the train data
xgc_train_perf = model_performance(xgc, x_train, y_train)

print("Training performance:")
xgc_train_perf
Training performance:
Out[93]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [94]:
# checking model performance on the validation data
xgc_val_perf = model_performance(xgc, x_val, y_val)

print("Validation performance:")
xgc_val_perf
Validation performance:
Out[94]:
Accuracy Recall Precision F1
0 0.962612 0.859016 0.903448 0.880672
In [95]:
# creating confusion matrix of the validation set
conf_mat(xgc, x_val, y_val)
  • XGBClassifier model shows an improved performance on the validation set with recall of 0.86

Oversampling Techniques

In [96]:
# Checking the train data shape before and after oversampling
print("Before UpSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1
)  # Synthetic Minority Over Sampling Technique
x_train_over, y_train_over = sm.fit_resample(
    x_train, y_train
)  # oversampling the train set


print("After UpSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))


print("After UpSampling, the shape of train_X: {}".format(x_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label '1': 915
Before UpSampling, counts of label '0': 4781 

After UpSampling, counts of label '1': 4781
After UpSampling, counts of label '0': 4781 

After UpSampling, the shape of train_X: (9562, 35)
After UpSampling, the shape of train_y: (9562,) 

Logistic Regression on Oversampled Data

In [97]:
# Training the basic logistic regression model with oversampled training set
lr_over = LogisticRegression(solver="newton-cg", random_state=1)
lr_over.fit(x_train_over, y_train_over)
Out[97]:
LogisticRegression(random_state=1, solver='newton-cg')
In [98]:
# checking model performance on the oversampled train data
lr_over_train_perf = model_performance(lr_over, x_train_over, y_train_over)

print("Training performance:")
lr_over_train_perf
Training performance:
Out[98]:
Accuracy Recall Precision F1
0 0.854947 0.851914 0.857113 0.854505
In [99]:
# checking model performance on the validation data
lr_over_val_perf = model_performance(lr_over, x_val, y_val)

print("Validation performance:")
lr_over_val_perf
Validation performance:
Out[99]:
Accuracy Recall Precision F1
0 0.85466 0.832787 0.530271 0.647959
In [100]:
# creating confusion matrix of the validation set
conf_mat(lr_over, x_val, y_val)
  • The logistic regression model on oversampled train data generalises well and also has good recall score of 0.83 on the validation set

Decision Tree Model on Oversampled data

In [101]:
# fitting the decision tree model on oversampled train data
dt_over = DecisionTreeClassifier(random_state=1)
dt_over.fit(x_train_over, y_train_over)
Out[101]:
DecisionTreeClassifier(random_state=1)
In [102]:
# checking model performance on the oversampled train data
dt_over_train_perf = model_performance(dt_over, x_train_over, y_train_over)

print("Training performance:")
dt_over_train_perf
Training performance:
Out[102]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [103]:
# checking model performance on the validation data
dt_over_val_perf = model_performance(dt_over, x_val, y_val)

print("Validation performance:")
dt_over_val_perf
Validation performance:
Out[103]:
Accuracy Recall Precision F1
0 0.926804 0.819672 0.748503 0.782473
In [104]:
# creating confusion matrix of the validation set
conf_mat(dt_over, x_val, y_val)
  • The dt_over model still overfits with recall of 0.82 on validation set. However confusion matrix still shows 55 misclassified Attrited Customers

Random Forest CLassifier on Oversampled Data

In [105]:
# Fitting the Random Forest Classifier on oversampled data
rf_over = RandomForestClassifier(random_state=1)
rf_over.fit(x_train_over, y_train_over)
Out[105]:
RandomForestClassifier(random_state=1)
In [106]:
# checking model performance on the oversampled train data
rf_over_train_perf = model_performance(rf_over, x_train_over, y_train_over)

print("Training performance:")
rf_over_train_perf
Training performance:
Out[106]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [107]:
# checking model performance on the validation data
rf_over_val_perf = model_performance(rf_over, x_val, y_val)

print("Validation performance:")
rf_over_val_perf
Validation performance:
Out[107]:
Accuracy Recall Precision F1
0 0.944708 0.813115 0.837838 0.825291
In [108]:
# creating confusion matrix of the validation set
conf_mat(rf_over, x_val, y_val)
  • The rf_over model also overfits but shows similar performance on validation set to dt_over model with recall of 0.81. However, the confusion matrix shows 57 misclassified Attrited Customers

Bagging Classifier on Oversampled Data

In [109]:
# Fitting the bagging classifier on oversampled data
bg_over = BaggingClassifier(random_state=1)
bg_over.fit(x_train_over, y_train_over)
Out[109]:
BaggingClassifier(random_state=1)
In [110]:
# checking model performance on the oversampled train data
bg_over_train_perf = model_performance(bg_over, x_train_over, y_train_over)

print("Training performance:")
bg_over_train_perf
Training performance:
Out[110]:
Accuracy Recall Precision F1
0 0.998431 0.99749 0.999371 0.99843
In [111]:
# checking model performance on the validation data
bg_over_val_perf = model_performance(bg_over, x_val, y_val)

print("Validation performance:")
bg_over_val_perf
Validation performance:
Out[111]:
Accuracy Recall Precision F1
0 0.938389 0.816393 0.803226 0.809756
In [112]:
# creating confusion matrix of the validation set
conf_mat(bg_over, x_val, y_val)
  • The bg_over model shows similar recall of 0.82 with the dt_over model on the validation set

AdaBoost Classifier on oversampled data

In [113]:
# Fitting the AdaBoost Classifier on oversampled data
abc_over = AdaBoostClassifier(random_state=1)
abc_over.fit(x_train_over, y_train_over)
Out[113]:
AdaBoostClassifier(random_state=1)
In [114]:
# checking model performance on the oversampled train data
abc_over_train_perf = model_performance(abc_over, x_train_over, y_train_over)

print("Training performance:")
abc_over_train_perf
Training performance:
Out[114]:
Accuracy Recall Precision F1
0 0.962665 0.968207 0.957592 0.962871
In [115]:
# checking model performance on the validation data
abc_over_val_perf = model_performance(abc_over, x_val, y_val)

print("Validation performance:")
abc_over_val_perf
Validation performance:
Out[115]:
Accuracy Recall Precision F1
0 0.934702 0.868852 0.759312 0.810398
In [116]:
# creating confusion matrix of the validation set
conf_mat(abc_over, x_val, y_val)
  • The abc_over model shows a better recall of 0.87 on the validation set and reduces the misclassified Attrited Customers to 40.

XGBoost Classfier on Oversampled Data

In [117]:
# Fitting the XGBoost Classifier on oversampled data
xgc_over = XGBClassifier(eval_metric="logloss", random_state=1)
xgc_over.fit(x_train_over, y_train_over)
Out[117]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [118]:
# checking model performance on the oversampled train data
xgc_over_train_perf = model_performance(xgc_over, x_train_over, y_train_over)

print("Training performance:")
xgc_over_train_perf
Training performance:
Out[118]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [119]:
# checking model performance on the validation data
xgc_over_val_perf = model_performance(xgc_over, x_val, y_val)

print("Validation performance:")
xgc_over_val_perf
Validation performance:
Out[119]:
Accuracy Recall Precision F1
0 0.965245 0.895082 0.889251 0.892157
In [120]:
# creating confusion matrix of the validation set
conf_mat(xgc_over, x_val, y_val)
  • The xgc_over model also shows improved performance on the validation set with recall of 0.90 and further reduced the misclassified Attrited Customers to 32.

Undersampling Techniques

In [121]:
# Undersampling the train set
rus = RandomUnderSampler(random_state=1)
x_train_un, y_train_un = rus.fit_resample(x_train, y_train)
In [122]:
# checking the shape of train set before and after undersampling was done
print("Before DownSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before DownSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

print("After DownSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After DownSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))


print("After DownSampling, the shape of train_X: {}".format(x_train_un.shape))
print("After DownSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before DownSampling, counts of label '1': 915
Before DownSampling, counts of label '0': 4781 

After DownSampling, counts of label '1': 915
After DownSampling, counts of label '0': 915 

After DownSampling, the shape of train_X: (1830, 35)
After DownSampling, the shape of train_y: (1830,) 

Logistic Regression on Undersampled data

In [123]:
# Training the basic logistic regression model with undersampled training set
lr_un = LogisticRegression(solver="newton-cg", random_state=1)
lr_un.fit(x_train_un, y_train_un)
Out[123]:
LogisticRegression(random_state=1, solver='newton-cg')
In [124]:
# checking model performance on the undersampled train data
lr_un_train_perf = model_performance(lr_un, x_train_un, y_train_un)

print("Training performance:")
lr_un_train_perf
Training performance:
Out[124]:
Accuracy Recall Precision F1
0 0.840984 0.831694 0.847439 0.839493
In [125]:
# checking model performance on the validation data
lr_un_val_perf = model_performance(lr_un, x_val, y_val)

print("Validation performance:")
lr_un_val_perf
Validation performance:
Out[125]:
Accuracy Recall Precision F1
0 0.847288 0.852459 0.514851 0.641975
In [126]:
# creating confusion matrix of the validation set
conf_mat(lr_un, x_val, y_val)
  • The logistic regression model on undersampled train data generalises well and also has good recall score of 0.85 on the validation set. However, it did poorly in identifying the existing customers with a precision score of 0.51

Decision Tree Model on Undersampled data

In [127]:
# fitting the decision tree model on undersampled train data
dt_un = DecisionTreeClassifier(random_state=1)
dt_un.fit(x_train_un, y_train_un)
Out[127]:
DecisionTreeClassifier(random_state=1)
In [128]:
# checking model performance on the oversampled train data
dt_un_train_perf = model_performance(dt_un, x_train_un, y_train_un)

print("Training performance:")
dt_un_train_perf
Training performance:
Out[128]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [129]:
# checking model performance on the validation data
dt_un_val_perf = model_performance(dt_un, x_val, y_val)

print("Validation performance:")
dt_un_val_perf
Validation performance:
Out[129]:
Accuracy Recall Precision F1
0 0.901527 0.881967 0.640476 0.742069
In [130]:
# creating confusion matrix of the validation set
conf_mat(dt_un, x_val, y_val)
  • The dt_un model still overfits with recall of 0.88 on validation set which is better than the decision tree model on oversampled data.

Random Forest Classifier on Undersampled Data

In [131]:
# Fitting the Random Forest Classifier on Undersampled data
rf_un = RandomForestClassifier(random_state=1)
rf_un.fit(x_train_un, y_train_un)
Out[131]:
RandomForestClassifier(random_state=1)
In [132]:
# checking model performance on the undersampled train data
rf_un_train_perf = model_performance(rf_un, x_train_un, y_train_un)

print("Training performance:")
rf_un_train_perf
Training performance:
Out[132]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [133]:
# checking model performance on the validation data
rf_un_val_perf = model_performance(rf_un, x_val, y_val)

print("Validation performance:")
rf_un_val_perf
Validation performance:
Out[133]:
Accuracy Recall Precision F1
0 0.923644 0.914754 0.701005 0.793741
In [134]:
# creating confusion matrix of the validation set
conf_mat(rf_un, x_val, y_val)
  • The rf_un model also overfits but shows good performance on validation set with recall of 0.91. However, the confusion matrix shows 26 misclassified Attrited Customers

Bagging Classifier on Undersampled Data

In [135]:
# Fitting the bagging classifier on undersampled data
bg_un = BaggingClassifier(random_state=1)
bg_un.fit(x_train_un, y_train_un)
Out[135]:
BaggingClassifier(random_state=1)
In [136]:
# checking model performance on the undersampled train data
bg_un_train_perf = model_performance(bg_un, x_train_un, y_train_un)

print("Training performance:")
bg_un_train_perf
Training performance:
Out[136]:
Accuracy Recall Precision F1
0 0.994536 0.990164 0.998897 0.994512
In [137]:
# checking model performance on the validation data
bg_un_val_perf = model_performance(bg_un, x_val, y_val)

print("Validation performance:")
bg_un_val_perf
Validation performance:
Out[137]:
Accuracy Recall Precision F1
0 0.92575 0.914754 0.708122 0.798283
In [138]:
# creating confusion matrix of the validation set
conf_mat(bg_un, x_val, y_val)
  • The bg_un model shows similar recall of 0.91 with the rf_un model on the validation set but with slightly better Precision of 0.71

AdaBoost Classifier on undersampled data

In [139]:
# Fitting the AdaBoost Classifier on undersampled data
abc_un = AdaBoostClassifier(random_state=1)
abc_un.fit(x_train_un, y_train_un)
Out[139]:
AdaBoostClassifier(random_state=1)
In [140]:
# checking model performance on the undersampled train data
abc_un_train_perf = model_performance(abc_un, x_train_un, y_train_un)

print("Training performance:")
abc_un_train_perf
Training performance:
Out[140]:
Accuracy Recall Precision F1
0 0.950273 0.95082 0.949782 0.9503
In [141]:
# checking model performance on the validation data
abc_un_val_perf = model_performance(abc_un, x_val, y_val)

print("Validation performance:")
abc_un_val_perf
Validation performance:
Out[141]:
Accuracy Recall Precision F1
0 0.926277 0.92459 0.706767 0.801136
In [142]:
# creating confusion matrix of the validation set
conf_mat(abc_un, x_val, y_val)
  • The abc_un model shows a better recall of 0.92 on the validation set and reduces the misclassified Attrited Customers to 23.

XGBoost Classfier on Undersampled Data

In [143]:
# Fitting the XGBoost Classifier on undersampled data
xgc_un = XGBClassifier(eval_metric="logloss", random_state=1)
xgc_un.fit(x_train_un, y_train_un)
Out[143]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [144]:
# checking model performance on the undersampled train data
xgc_un_train_perf = model_performance(xgc_un, x_train_un, y_train_un)

print("Training performance:")
xgc_un_train_perf
Training performance:
Out[144]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [145]:
# checking model performance on the validation data
xgc_un_val_perf = model_performance(xgc_un, x_val, y_val)

print("Validation performance:")
xgc_un_val_perf
Validation performance:
Out[145]:
Accuracy Recall Precision F1
0 0.947341 0.947541 0.774799 0.852507
In [146]:
# creating confusion matrix of the validation set
conf_mat(xgc_un, x_val, y_val)
  • The xgc_un model also shows improved performance on the validation set with recall of 0.95 and precision of 0.77 and further reduced the misclassified Attrited Customers to 16.

Tuning the models with good recall scores

In [147]:
# Tuning the BaggingClassifier with undersampled data
bg_un_tuned = BaggingClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_samples": [0.7, 0.8, 0.9, 1],
    "max_features": [0.7, 0.8, 0.9, 1],
    "n_estimators": [10, 20, 30, 40, 50],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(bg_un_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)

# Set the classifier to the best combination of parameters
bg_un_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bg_un_tuned.fit(x_train_un, y_train_un)
Out[147]:
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=30,
                  random_state=1)
In [148]:
# checking model performance on the undersampled train data
bg_un_tuned_train_perf = model_performance(bg_un_tuned, x_train_un, y_train_un)

print("Training performance:")
bg_un_tuned_train_perf
Training performance:
Out[148]:
Accuracy Recall Precision F1
0 0.999454 1.0 0.998908 0.999454
In [149]:
# checking model performance on the validation data
bg_un_tuned_val_perf = model_performance(bg_un_tuned, x_val, y_val)

print("Validation performance:")
bg_un_tuned_val_perf
Validation performance:
Out[149]:
Accuracy Recall Precision F1
0 0.931016 0.931148 0.720812 0.812589
In [150]:
# creating confusion matrix of the validation set
conf_mat(bg_un_tuned, x_val, y_val)
  • The model shows an improved recall of 0.93 when compared to the earlier bagging classifier with default parameters on undersampled train data
In [151]:
# Tuning the AdaboostClassifier on Undersampled data
abc_un_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    # Let's try different max_depth for base_estimator
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
    "n_estimators": np.arange(10, 110, 10),
    "learning_rate": np.arange(0.1, 2, 0.1),
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Running the grid search
grid_obj = GridSearchCV(abc_un_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)

# Setting the classifier to the best combination of parameters
abc_un_tuned = grid_obj.best_estimator_

# Fitting the best algorithm to the data.
abc_un_tuned.fit(x_train_un, y_train_un)
Out[151]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                         random_state=1),
                   learning_rate=0.6, n_estimators=100, random_state=1)
In [152]:
# checking model performance on the undersampled train data
abc_un_tuned_train_perf = model_performance(abc_un_tuned, x_train_un, y_train_un)

print("Training performance:")
abc_un_tuned_train_perf
Training performance:
Out[152]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [153]:
# checking model performance on the validation data
abc_un_tuned_val_perf = model_performance(abc_un_tuned, x_val, y_val)

print("Validation performance:")
abc_un_tuned_val_perf
Validation performance:
Out[153]:
Accuracy Recall Precision F1
0 0.942601 0.940984 0.759259 0.84041
In [154]:
# creating confusion matrix of the validation set
conf_mat(abc_un_tuned, x_val, y_val)
  • The tuned adaboost classifier shows a better performance on the validation set with recall of 0.94
In [155]:
# to tune the XGBoost classifier on undersampled data.
xgc_un_tuned = XGBClassifier(random_state=1, eval_metric="logloss")

# Grid of parameters to choose from
parameters = {
    "n_estimators": [75, 100, 125, 150],
    "subsample": [0.7, 0.8, 0.9, 1],
    "gamma": [0, 1, 3, 5],
    "colsample_bytree": [0.7, 0.8, 0.9, 1],
    "colsample_bylevel": [0.7, 0.8, 0.9, 1],
}

# Using f1_score to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Running the grid search
grid_obj = GridSearchCV(xgc_un_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)

# Setting the classifier to the best combination of parameters
xgc_un_tuned = grid_obj.best_estimator_

# Fitting the best algorithm to the data.
xgc_un_tuned.fit(x_train_un, y_train_un)
Out[155]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
              colsample_bynode=1, colsample_bytree=0.8,
              enable_categorical=False, eval_metric='logloss', gamma=3,
              gpu_id=-1, importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=150, n_jobs=4, num_parallel_tree=1, predictor='auto',
              random_state=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)
In [156]:
# checking model performance on the undersampled train data
xgc_un_tuned_train_perf = model_performance(xgc_un_tuned, x_train_un, y_train_un)

print("Training performance:")
xgc_un_tuned_train_perf
Training performance:
Out[156]:
Accuracy Recall Precision F1
0 0.994536 0.997814 0.991314 0.994553
In [157]:
# checking model performance on the validation data
xgc_un_tuned_val_perf = model_performance(xgc_un_tuned, x_val, y_val)

print("Validation performance:")
xgc_un_tuned_val_perf
Validation performance:
Out[157]:
Accuracy Recall Precision F1
0 0.947341 0.947541 0.774799 0.852507
In [158]:
# creating confusion matrix of the validation set
conf_mat(xgc_un_tuned, x_val, y_val)
  • The xgc_un_tuned model shows similar performance on validation set to the earlier XGBoost CLassifier with default parameters on undersampled data

Tuning the Same Three Models with RandomizedSearchCV

In [159]:
# Tuning the BaggingClassifier with undersampled data using RandomizedSearchCV
bg_un_tuned_rs = BaggingClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_samples": [0.5, 0.7, 0.8, 0.9, 1],
    "max_features": [0.5, 0.7, 0.8, 0.9, 1],
    "n_estimators": [10, 20, 30, 40, 50, 100],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = RandomizedSearchCV(bg_un_tuned_rs, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)

# Set the classifier to the best combination of parameters
bg_un_tuned_rs = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bg_un_tuned_rs.fit(x_train_un, y_train_un)
Out[159]:
BaggingClassifier(max_features=0.7, max_samples=0.9, n_estimators=100,
                  random_state=1)
In [160]:
# checking model performance on the undersampled train data
bg_un_tuned_rs_train_perf = model_performance(bg_un_tuned_rs, x_train_un, y_train_un)

print("Training performance:")
bg_un_tuned_rs_train_perf
Training performance:
Out[160]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [161]:
# checking model performance on the validation data
bg_un_tuned_rs_val_perf = model_performance(bg_un_tuned_rs, x_val, y_val)

print("Validation performance:")
bg_un_tuned_rs_val_perf
Validation performance:
Out[161]:
Accuracy Recall Precision F1
0 0.932596 0.927869 0.727506 0.815562
In [162]:
# creating confusion matrix of the validation set
conf_mat(bg_un_tuned_rs, x_val, y_val)
  • The model shows similar performance on validation set when compared to the performance of GridSearchCV tuned bagging classifier model
In [163]:
# Tuning the AdaboostClassifier on Undersampled data using RandomizedSearchCv
abc_un_tuned_rs = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    # Let's try different max_depth for base_estimator
    "base_estimator": [
        DecisionTreeClassifier(max_depth=5, random_state=1),
        DecisionTreeClassifier(max_depth=10, random_state=1),
        DecisionTreeClassifier(max_depth=15, random_state=1),
    ],
    "n_estimators": np.arange(10, 150, 10),
    "learning_rate": np.arange(0.1, 2, 0.1),
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Running the grid search
grid_obj = RandomizedSearchCV(abc_un_tuned_rs, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)

# Setting the classifier to the best combination of parameters
abc_un_tuned_rs = grid_obj.best_estimator_

# Fitting the best algorithm to the data.
abc_un_tuned_rs.fit(x_train_un, y_train_un)
Out[163]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,
                                                         random_state=1),
                   learning_rate=0.8, n_estimators=110, random_state=1)
In [164]:
# checking model performance on the undersampled train data
abc_un_tuned_rs_train_perf = model_performance(abc_un_tuned_rs, x_train_un, y_train_un)

print("Training performance:")
abc_un_tuned_rs_train_perf
Training performance:
Out[164]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [165]:
# checking model performance on the validation data
abc_un_tuned_rs_val_perf = model_performance(abc_un_tuned_rs, x_val, y_val)

print("Validation performance:")
abc_un_tuned_rs_val_perf
Validation performance:
Out[165]:
Accuracy Recall Precision F1
0 0.941022 0.931148 0.757333 0.835294
In [166]:
# creating confusion matrix of the validation set
conf_mat(abc_un_tuned_rs, x_val, y_val)
  • The model also shows similar performance on validation set when compared to the performance of GridSearchCV tuned AdaBoost classifier model
In [167]:
# to tune the XGBoost classifier on undersampled data using RandomizedSearchCV
xgc_un_tuned_rs = XGBClassifier(random_state=1, eval_metric="logloss")

# Grid of parameters to choose from
parameters = {
    "n_estimators": [75, 100, 125, 150],
    "subsample": [0.7, 0.8, 0.9, 1],
    "gamma": [0, 1, 3, 5],
    "colsample_bytree": [0.7, 0.8, 0.9, 1],
    "colsample_bylevel": [0.7, 0.8, 0.9, 1],
}

# Using f1_score to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Running the grid search
grid_obj = RandomizedSearchCV(xgc_un_tuned_rs, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)

# Setting the classifier to the best combination of parameters
xgc_un_tuned_rs = grid_obj.best_estimator_

# Fitting the best algorithm to the data.
xgc_un_tuned_rs.fit(x_train_un, y_train_un)
Out[167]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.7,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=1, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=125, n_jobs=4,
              num_parallel_tree=1, predictor='auto', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.9,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [168]:
# checking model performance on the undersampled train data
xgc_un_tuned_rs_train_perf = model_performance(xgc_un_tuned_rs, x_train_un, y_train_un)

print("Training performance:")
xgc_un_tuned_rs_train_perf
Training performance:
Out[168]:
Accuracy Recall Precision F1
0 0.999454 1.0 0.998908 0.999454
In [169]:
# checking model performance on the validation data
xgc_un_tuned_rs_val_perf = model_performance(xgc_un_tuned_rs, x_val, y_val)

print("Validation performance:")
xgc_un_tuned_rs_val_perf
Validation performance:
Out[169]:
Accuracy Recall Precision F1
0 0.949447 0.944262 0.784741 0.857143
In [170]:
# creating confusion matrix of the validation set
conf_mat(xgc_un_tuned_rs, x_val, y_val)
  • The model shows a slightly lower performance on validation set when compared to the performance of GridSearchCV tuned XGBClassifier model

Model Performance Summary

In [171]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        lr_train_perf.T,
        dt_train_perf.T,
        rf_train_perf.T,
        bg_train_perf.T,
        abc_train_perf.T,
        xgc_train_perf.T,
        lr_over_train_perf.T,
        dt_over_train_perf.T,
        rf_over_train_perf.T,
        bg_over_train_perf.T,
        abc_over_train_perf.T,
        xgc_over_train_perf.T,
        lr_un_train_perf.T,
        dt_un_train_perf.T,
        rf_un_train_perf.T,
        bg_un_train_perf.T,
        abc_un_train_perf.T,
        xgc_un_train_perf.T,
        bg_un_tuned_train_perf.T,
        abc_un_tuned_train_perf.T,
        xgc_un_tuned_train_perf.T,
        bg_un_tuned_rs_train_perf.T,
        abc_un_tuned_rs_train_perf.T,
        xgc_un_tuned_rs_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression",
    "Decision Tree",
    "Random Forest",
    "Bagging",
    "AdaBoost",
    "XGBoost",
    "Logistic Regression-OverSampling",
    "Decision Tree-OverSampling",
    "Random Forest-OverSampling",
    "Bagging-OverSampling",
    "AdaBoost-OverSampling",
    "XGBoost-OverSampling",
    "Logistic Regression-UnderSampling",
    "Decision Tree-UnderSampling",
    "Random Forest-UnderSampling",
    "Bagging-UnderSampling",
    "AdaBoost-UnderSampling",
    "XGBoost-UnderSampling",
    "Bagging_UnderSampling_GS",
    "AdaBoost_UnderSampling_GS",
    "XGBoost_UnderSampling_GS",
    "Bagging_UnderSampling_RS",
    "AdaBoost_UnderSampling_RS",
    "XGBoost_UnderSampling_RS",
]
print("Training Performance Comparison:")
models_train_comp_df.T
Training Performance Comparison:
Out[171]:
Accuracy Recall Precision F1
Logistic Regression 0.904670 0.559563 0.785276 0.653478
Decision Tree 1.000000 1.000000 1.000000 1.000000
Random Forest 1.000000 1.000000 1.000000 1.000000
Bagging 0.996664 0.982514 0.996674 0.989543
AdaBoost 0.965414 0.865574 0.914550 0.889388
XGBoost 1.000000 1.000000 1.000000 1.000000
Logistic Regression-OverSampling 0.854947 0.851914 0.857113 0.854505
Decision Tree-OverSampling 1.000000 1.000000 1.000000 1.000000
Random Forest-OverSampling 1.000000 1.000000 1.000000 1.000000
Bagging-OverSampling 0.998431 0.997490 0.999371 0.998430
AdaBoost-OverSampling 0.962665 0.968207 0.957592 0.962871
XGBoost-OverSampling 1.000000 1.000000 1.000000 1.000000
Logistic Regression-UnderSampling 0.840984 0.831694 0.847439 0.839493
Decision Tree-UnderSampling 1.000000 1.000000 1.000000 1.000000
Random Forest-UnderSampling 1.000000 1.000000 1.000000 1.000000
Bagging-UnderSampling 0.994536 0.990164 0.998897 0.994512
AdaBoost-UnderSampling 0.950273 0.950820 0.949782 0.950300
XGBoost-UnderSampling 1.000000 1.000000 1.000000 1.000000
Bagging_UnderSampling_GS 0.999454 1.000000 0.998908 0.999454
AdaBoost_UnderSampling_GS 1.000000 1.000000 1.000000 1.000000
XGBoost_UnderSampling_GS 0.994536 0.997814 0.991314 0.994553
Bagging_UnderSampling_RS 1.000000 1.000000 1.000000 1.000000
AdaBoost_UnderSampling_RS 1.000000 1.000000 1.000000 1.000000
XGBoost_UnderSampling_RS 0.999454 1.000000 0.998908 0.999454
In [172]:
# Validation performance comparison

models_val_comp_df = pd.concat(
    [
        lr_val_perf.T,
        dt_val_perf.T,
        rf_val_perf.T,
        bg_val_perf.T,
        abc_val_perf.T,
        xgc_val_perf.T,
        lr_over_val_perf.T,
        dt_over_val_perf.T,
        rf_over_val_perf.T,
        bg_over_val_perf.T,
        abc_over_val_perf.T,
        xgc_over_val_perf.T,
        lr_un_val_perf.T,
        dt_un_val_perf.T,
        rf_un_val_perf.T,
        bg_un_val_perf.T,
        abc_un_val_perf.T,
        xgc_un_val_perf.T,
        bg_un_tuned_val_perf.T,
        abc_un_tuned_val_perf.T,
        xgc_un_tuned_val_perf.T,
        bg_un_tuned_rs_val_perf.T,
        abc_un_tuned_rs_val_perf.T,
        xgc_un_tuned_rs_val_perf.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Logistic Regression",
    "Decision Tree",
    "Random Forest",
    "Bagging",
    "AdaBoost",
    "XGBoost",
    "Logistic Regression-OverSampling",
    "Decision Tree-OverSampling",
    "Random Forest-OverSampling",
    "Bagging-OverSampling",
    "AdaBoost-OverSampling",
    "XGBoost-OverSampling",
    "Logistic Regression-UnderSampling",
    "Decision Tree-UnderSampling",
    "Random Forest-UnderSampling",
    "Bagging-UnderSampling",
    "AdaBoost-UnderSampling",
    "XGBoost-UnderSampling",
    "Bagging_UnderSampling_GS",
    "AdaBoost_UnderSampling_GS",
    "XGBoost_UnderSampling_GS",
    "Bagging_UnderSampling_RS",
    "AdaBoost_UnderSampling_RS",
    "XGBoost_UnderSampling_RS",
]

print("Validation Set Performance Comparison:")
models_val_comp_df.T
Validation Set Performance Comparison:
Out[172]:
Accuracy Recall Precision F1
Logistic Regression 0.902054 0.583607 0.751055 0.656827
Decision Tree 0.938389 0.783607 0.824138 0.803361
Random Forest 0.945234 0.731148 0.910204 0.810909
Bagging 0.955766 0.816393 0.898917 0.855670
AdaBoost 0.945761 0.793443 0.858156 0.824532
XGBoost 0.962612 0.859016 0.903448 0.880672
Logistic Regression-OverSampling 0.854660 0.832787 0.530271 0.647959
Decision Tree-OverSampling 0.926804 0.819672 0.748503 0.782473
Random Forest-OverSampling 0.944708 0.813115 0.837838 0.825291
Bagging-OverSampling 0.938389 0.816393 0.803226 0.809756
AdaBoost-OverSampling 0.934702 0.868852 0.759312 0.810398
XGBoost-OverSampling 0.965245 0.895082 0.889251 0.892157
Logistic Regression-UnderSampling 0.847288 0.852459 0.514851 0.641975
Decision Tree-UnderSampling 0.901527 0.881967 0.640476 0.742069
Random Forest-UnderSampling 0.923644 0.914754 0.701005 0.793741
Bagging-UnderSampling 0.925750 0.914754 0.708122 0.798283
AdaBoost-UnderSampling 0.926277 0.924590 0.706767 0.801136
XGBoost-UnderSampling 0.947341 0.947541 0.774799 0.852507
Bagging_UnderSampling_GS 0.931016 0.931148 0.720812 0.812589
AdaBoost_UnderSampling_GS 0.942601 0.940984 0.759259 0.840410
XGBoost_UnderSampling_GS 0.947341 0.947541 0.774799 0.852507
Bagging_UnderSampling_RS 0.932596 0.927869 0.727506 0.815562
AdaBoost_UnderSampling_RS 0.941022 0.931148 0.757333 0.835294
XGBoost_UnderSampling_RS 0.949447 0.944262 0.784741 0.857143
  • XGBoost Classifier fitted on undersampled data with default parameters(XGBoost-Undersampling) gives the best performance on the validation set with a recall of 0.95. We will therefore go with this model. Although the XGBoost_Undesampling_GS shows the same performance on the validation set, we will still stick with the untuned XGBoost-Undersampling model.
In [173]:
# To plot the importances of all the independent variable based on our best model
importances = xgc_un.feature_importances_
indices = np.argsort(importances)
feature_names = list(x.columns)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Total_Trans_Ct, Total_Revolving_Bal,Total_Relationship_Count are the top 3 most important factors determining if a credit card customer will leave the bank or not.

Productionizing the Best Model using Pipeline

In [174]:
xgc_un_pipe = make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric="logloss"))

# Fit the model on undersampled training data
xgc_un_pipe.fit(x_train_un, y_train_un)
Out[174]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('xgbclassifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               eval_metric='logloss', gamma=0, gpu_id=-1,
                               importance_type=None, interaction_constraints='',
                               learning_rate=0.03, max_delta_step=0,
                               max_depth=6, min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=4, num_parallel_tree=1, predictor='auto',
                               random_state=1, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=1, subsample=1,
                               tree_method='exact', validate_parameters=1,
                               verbosity=None))])
In [175]:
# checking model performance on the undersampled train data
xgc_un_pipe_train_perf = model_performance(xgc_un_pipe, x_train_un, y_train_un)

print("Training performance:")
xgc_un_pipe_train_perf
Training performance:
Out[175]:
Accuracy Recall Precision F1
0 0.986885 0.990164 0.983713 0.986928
In [176]:
# checking model performance on the validation data
xgc_un_pipe_val_perf = model_performance(xgc_un_pipe, x_val, y_val)

print("Validation performance:")
xgc_un_pipe_val_perf
Validation performance:
Out[176]:
Accuracy Recall Precision F1
0 0.926277 0.934426 0.703704 0.802817
  • The performance of the productionized model on the validation set is very similar to that of our best model;though with slightly lower recall of 0.93

Checking Best Model Performance on test data

In [177]:
# checking model performance on the test data
xgc_un_test_perf = model_performance(xgc_un, x_test, y_test)

print("Test Set performance:")
xgc_un_test_perf
Test Set performance:
Out[177]:
Accuracy Recall Precision F1
0 0.945498 0.97543 0.75619 0.851931
In [178]:
# creating confusion matrix of the test set
conf_mat(xgc_un, x_test, y_test)
  • The best model shows good performance on the test set with recall score of 0.98 and is able to correctly identify 397 out of 407 Attrited Customers

Summary of Business Recommendations and Insights

  • Credit Card Customers within the age bracket 40-55 leave the bank the most in terms of absolute number. The bank should focus their customer retention strategies on these age group.

  • Customers with blue cards attrited the most. The bank should look more into improving customer experience with that card so that their satisfaction with the product can be enhanced.

  • It is obvious that customers with just one or two products of the bank have higher likelihood of attriting than other customers. The bank should therefore up their products cross-selling effort to ensure that each customers is on-boarded at least on 4 products to reduce their chance of attriting.

  • The more the counts of contacts a customer has with the bank, the lesser the likelihood of attriting and vice versa. The bank should therefore device a system of reaching out to their customers at least once in 2 months even if the customers do not come physically to the bank's premise. This will stem the tide of Attrition greatly.

  • Total_Trans_Ct, Total_Revolving_Bal,Total_Relationship_Count are the top 3 most important factors determining if a credit card customer will leave the bank or not. The bank should therefore monitor these parameters closely and can set up thresholds on the bank’s application as early warning signals of attrition. These thresholds can for instance be the 75th Percentile of these parameters under the ‘Characteristics of Attrited Customers’ depicted above.

In [ ]: